1. Presentation of the case


2. Data exploration

2.1. Data source

As mentioned above, the main data has been sourced from the opendata.swiss project, having been collected and published by the Open Data Portal of the City Council of Zürich, under the name “Hundebestände der Stadt Zürich, seit 2015”. The description of the data set from the original source is as follows:

This dataset contains information on dogs and their owners from the municipal dog register since 2015. Information on the age group, gender and statistical district of residence is provided for dog owners. The breed, breed type, sex, year of birth, age and color are recorded for each dog. The dog register is kept by the Dog Control Department of the Zurich City Police.

For the sake of a seamless workflow and easier interpretation of the variables within our group, the names of columns as well as certain string values have been translated to English from the original German version.

The main source of data is the kul100od1001.csv file, which contains a collection of 70,967 listings with 33 variables.

dim(df.dogs)
## [1] 70967    15
str(df.dogs)
## 'data.frame':    70967 obs. of  15 variables:
##  $ ReferenceYear : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ OwnerId       : int  126 574 695 893 1177 4004 4050 4155 4203 4215 ...
##  $ AgeV10Text    : chr  "60- bis 69-Jährige" "60- bis 69-Jährige" "40- bis 49-Jährige" "60- bis 69-Jährige" ...
##  $ OwnerSexText  : chr  "männlich" "weiblich" "männlich" "weiblich" ...
##  $ DistrictText  : chr  "Kreis 9" "Kreis 2" "Kreis 6" "Kreis 7" ...
##  $ QuarterText   : chr  "Altstetten" "Leimbach" "Oberstrass" "Fluntern" ...
##  $ Breed1Text    : chr  "Welsh Terrier" "Cairn Terrier" "Labrador Retriever" "Mittelschnauzer" ...
##  $ Breed2Text    : chr  "Keine" "Keine" "Keine" "Keine" ...
##  $ MixedBreedText: chr  "Rassehund" "Rassehund" "Rassehund" "Rassehund" ...
##  $ BreedTypeLong : chr  "Kleinwüchsig" "Kleinwüchsig" "Rassentypenliste I" "Rassentypenliste I" ...
##  $ DogBirthYear  : int  2011 2002 2012 2010 2011 2010 2012 2002 2005 2001 ...
##  $ DogAgeCoded   : int  3 12 2 4 3 4 2 12 9 13 ...
##  $ DogSexText    : chr  "weiblich" "weiblich" "weiblich" "weiblich" ...
##  $ DogColorText  : chr  "schwarz/braun" "brindle" "braun" "schwarz" ...
##  $ NumberOfDogs  : int  1 1 1 1 1 1 1 1 1 1 ...

As can be seen in the structure of the data, the set comprises several observations of diverse data types. Most variables are expressed three times as different types, as integers (Coded and Sort form), as well as strings (Text). Depending on their implementation in the study they have been selected in one of the three variants, therefore our selection of relevant observations can be summarized as follows:

Numerical values:

  • ReferenceYear: numerical value for the reference year
  • OwnerId: numerical identifier for the owner of the registered dog
  • AgeV10Sort: referring to the owner’s age as a 10-year category
  • DogBirthYear: numerical value for the birth year of the dog
  • DogAgeSort: referring to the dog’s age at the time of registration
  • NumberOfDogs: numerical counter of the dog count for each dog owner

Binary variables: !!! Is breed multinomial or factor? !!!

  • DogSexText: numerical value indicating two states for the biological sex of the dog

String values:

  • DistricText: the name of each larger district of Zürich according to the official division
  • QuarterText: the name of the smaller neighbourhoods which comprise the larger districts
  • Breed1Text and Breed1Text2: referring to dog race denominations and information
  • MixedBreedText: additional information regarding race mixing in the dog
  • DogColorText: a descriptive name for the colour of the dog
  • BreedTypeLong: referring to the official dog type classification according to the Zürich Cantonal Law

The original data set has been complemented with the GEOJSON file stzh.adm_stadtkreise_a.geojson for the production of map plots, by merging both data sets with the district name variables, as convened by the City Council of Zürich.


2.3. Data insights and research questions

Considering that the dataset predominantly consists of categorical observations with minimal quantitative variables, our approach involves segmenting the exploratory analysis into inquiries centered around various count-based groupings. Subsequently, we will match specific models from our study to the research questions and variables that are best suited for their respective capabilities. The following insights and plots offer a glimpse into the dataset, unveiling potential research avenues to explore.

2.3.1. Count by owner sex by year


2.3.2. Count by dog sex and year


2.3.3. Top 10 count by dog breed


2.3.4. Overall count by owner age group


2.3.5. Average dog age per district


2.2. Correlation plot


3. Models

3.1. Linear Model: dog count by district over time

ggplotly(fit01_ggplot)
## `geom_smooth()` using formula = 'y ~ x'
lm.counts.year <- lm(DogCount ~ ReferenceYear * DistrictText,
                     data = dog_count_per_neighborhood_year)
summary(lm.counts.year)
## 
## Call:
## lm(formula = DogCount ~ ReferenceYear * DistrictText, data = dog_count_per_neighborhood_year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.572 -16.100  -0.033  16.181  58.222 
## 
## Coefficients:
##                                                      Estimate Std. Error
## (Intercept)                                        -5.527e+03  7.289e+03
## ReferenceYear                                       2.800e+00  3.610e+00
## DistrictTextKreis 10                               -3.312e+04  1.031e+04
## DistrictTextKreis 11                               -1.047e+05  1.031e+04
## DistrictTextKreis 12                               -3.837e+04  1.031e+04
## DistrictTextKreis 2                                -9.000e+04  1.031e+04
## DistrictTextKreis 3                                -5.168e+04  1.031e+04
## DistrictTextKreis 4                                -2.623e+04  1.031e+04
## DistrictTextKreis 5                                -3.056e+04  1.031e+04
## DistrictTextKreis 6                                -3.980e+04  1.031e+04
## DistrictTextKreis 7                                -7.468e+04  1.031e+04
## DistrictTextKreis 8                                -2.999e+04  1.031e+04
## DistrictTextKreis 9                                -8.681e+04  1.031e+04
## DistrictTextUnbekannt (Stadt Zürich)                5.263e+03  1.170e+04
## ReferenceYear:DistrictTextKreis 10                  1.670e+01  5.106e+00
## ReferenceYear:DistrictTextKreis 11                  5.245e+01  5.106e+00
## ReferenceYear:DistrictTextKreis 12                  1.922e+01  5.106e+00
## ReferenceYear:DistrictTextKreis 2                   4.488e+01  5.106e+00
## ReferenceYear:DistrictTextKreis 3                   2.588e+01  5.106e+00
## ReferenceYear:DistrictTextKreis 4                   1.313e+01  5.106e+00
## ReferenceYear:DistrictTextKreis 5                   1.520e+01  5.106e+00
## ReferenceYear:DistrictTextKreis 6                   1.992e+01  5.106e+00
## ReferenceYear:DistrictTextKreis 7                   3.747e+01  5.106e+00
## ReferenceYear:DistrictTextKreis 8                   1.500e+01  5.106e+00
## ReferenceYear:DistrictTextKreis 9                   4.342e+01  5.106e+00
## ReferenceYear:DistrictTextUnbekannt (Stadt Zürich) -2.668e+00  5.798e+00
##                                                    t value Pr(>|t|)    
## (Intercept)                                         -0.758 0.450412    
## ReferenceYear                                        0.776 0.440152    
## DistrictTextKreis 10                                -3.213 0.001858 ** 
## DistrictTextKreis 11                               -10.157 2.49e-16 ***
## DistrictTextKreis 12                                -3.722 0.000354 ***
## DistrictTextKreis 2                                 -8.731 1.89e-13 ***
## DistrictTextKreis 3                                 -5.013 2.89e-06 ***
## DistrictTextKreis 4                                 -2.545 0.012742 *  
## DistrictTextKreis 5                                 -2.965 0.003933 ** 
## DistrictTextKreis 6                                 -3.861 0.000220 ***
## DistrictTextKreis 7                                 -7.244 1.83e-10 ***
## DistrictTextKreis 8                                 -2.910 0.004618 ** 
## DistrictTextKreis 9                                 -8.421 8.01e-13 ***
## DistrictTextUnbekannt (Stadt Zürich)                 0.450 0.654061    
## ReferenceYear:DistrictTextKreis 10                   3.271 0.001550 ** 
## ReferenceYear:DistrictTextKreis 11                  10.273  < 2e-16 ***
## ReferenceYear:DistrictTextKreis 12                   3.764 0.000307 ***
## ReferenceYear:DistrictTextKreis 2                    8.791 1.43e-13 ***
## ReferenceYear:DistrictTextKreis 3                    5.070 2.30e-06 ***
## ReferenceYear:DistrictTextKreis 4                    2.572 0.011839 *  
## ReferenceYear:DistrictTextKreis 5                    2.977 0.003790 ** 
## ReferenceYear:DistrictTextKreis 6                    3.901 0.000191 ***
## ReferenceYear:DistrictTextKreis 7                    7.338 1.19e-10 ***
## ReferenceYear:DistrictTextKreis 8                    2.938 0.004252 ** 
## ReferenceYear:DistrictTextKreis 9                    8.504 5.46e-13 ***
## ReferenceYear:DistrictTextUnbekannt (Stadt Zürich)  -0.460 0.646508    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.96 on 85 degrees of freedom
## Multiple R-squared:  0.9953, Adjusted R-squared:  0.9939 
## F-statistic: 718.6 on 25 and 85 DF,  p-value: < 2.2e-16

3.2. Generalized Linear Model (Poisson)


3.3. Generalized Linear Model (Binomial)

# Display summary of the model
#summary(binomial_model)

3.4. Generalized Additive Model (Binomial)


3.5. Neural Network


3.6. Support Vector Machine Model


3.7. Optimisation Problem


4. Additional chapter


5. Conclusion


6. Appendix: Working with generative AI tools